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Abstract 

A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and 
classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the 
relationship with the general inverse Gaussian distribution. This approach is mathematically elegant 
and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated 
on both simulated and real data to illustrate clustering and classification applications. In these analyses, 
our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the 
popular Gaussian approach. This work, which marks an important step in the non-Gaussian model- 
based clustering direction, concludes with discussion and suggestions for future work. 
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1 Introduction 

Finite mixture models are based on the underlying assumption that a population is a convex combination of 
a finite number of densities. They therefore lend themselves quite naturally to classification and clustering 
problems. Formally, a random vector X arises from a parametric finite mixture distribution if, for all x C X, 
we can write its density as 

G 

/(x | 0) = 5> ff / g (x | 

3=1 

where ir g > 0, such that Ylg=i ""g = 1 are the mixing proportions, /i(x | 9 ),...,/g(x | 8 g ) are called 
component densities, and i9 = (n, Q\, . . . , Og) is the vector of parameters with 7r = (tti, . . . , ttg)- The com- 
ponent densities /i(x | 8\), . . . , /g(x | Oq) are usually taken to be of the same type, most often multivariate 
Gaussian. In the event that the component densities are multivariate Gaussian, the density of the mixture 
model is /(x | ■&) = 53 s =i 7r s0( x I Mgj^s)) where 0(x | /u g ,S ff ) is the multivariate Gaussian density with 
mean fi g and covariance matrix S s . The popularity of the multivariate Gaussian distribution is due to 
its mathematical tractability and its flexibility in capturing densities; we will return to this latter point in 



Section 4.1 Herein, we shall follow convention and use the term model-based clustering to mean cluster- 



ing using mixture models. Model-based classification (e.g., McNicholas 2010), or partial classification (cf. 
McLachlan 1992 Section 2.7), can be regarded as a semi-supervised version of model-based clustering. 
At the time of the review paper of Fraley and Raftery (2002), almost all work on clustering and classifi- 



cation using mixture models had been based on Gaussian mixture models. This includes work by 


Banfield 


and Raftery 


(1993 


), Celeux and Govaert 


(1995), 


Ghahramani and Hinton 


(1997 


)■ 


Tipping and Bishop 


(1997 



ample of non-Gaussian work from this time is the early work on clustering using mixtures of multivariate 



t-distributions carried out by McLachlan and Peel (1998) and Peel and McLachlan (20001. This work was 
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the forerunner to several papers on clustering using mixtures of multivariate i-distributions, including those 



by 


McLachlan et al. 


(2007), 


Andrews and McNicholas 


, (2011 


), and Baek and McLachlan 


(2011 


). Work 


has also burgeoned on skew- normal distributions (e.g., 


jin 


2009 


), skew-t distributions (e.g., 


Lin 


2010 


Lee 



and McLachlan 2011 Vrbik and McNicholas 2012), and other non-elliptically contoured distributions (e.g. 



Karlis and Meligkotsidou 2007 Karlis and Santourian| 2009 Browne et al. 20121 



The recent burgeoning of non-Gaussian approaches to model-based clustering and classification has coin- 
cided with yet more papers on Gaussian approaches. These include work on extensions of mixtures of factor 



analyzers (McNicholas and Murphy 2008 2010 Baek et al 



2010) and developments on variable selection 


( Raftery and Dean 


2006 


Maugis et al. 


2009 



Scrucca 20101. Nevertheless, the fecundity of non-Gaussian approaches has certainly been more potent than 
that of Gaussian approaches over the last few years. This paper introduces a non-Gaussian approach that 
allows for skewness while also parameterizing location and scale. Our approach is effective while also being 
mathematically elegant and relatively computationally straightforward. The methodology is developed in 
Section [2j parameter estimation via deterministic annealing and an expectation-maximization algorithm is 
outlined in Section [3] and both simulated and real data analyses are used to illustrate our approach in 
Section |2J The paper concludes with a summary and suggestions for future work (Section [5]) . 



2 Methodology 

2.1 Generalized Inverse Gaussian Distribution 



The density of a random variable X following a generalized inverse Gaussian (GIG) distribution is given by 

(1) 



, , (a/ft)"/ 2 !"- 1 
q(x) = == — exp 



ax + b/x 
2 



for x > 0, where a, b E M + , v € R, and K v is the modified Bcssel function of the third kind with index v. 
There are several special cases of the GIG distribution, such as the gamma distribution (b = 0, v > 0) and 
the inverse Gaussian distribution (y = —1/2). The GIG distribution was introduced by Good (1953) and 



1979 


), and 


J0rgensen 


(1982) 



its statistical properties were laid down by Barndorff-Nielsen and Halgreen (1977), Blaesild (1978), Halgreen 



It has some attractive properties including the tractability of the following 



expected values, 



ELY] 



VbK v+1 (Vab 



faK„ 



E[l/X] 



sfaK v+1 (Vabj 2 u 
VbK„ ( Vab^ b 



(2) 



2.2 Shifted Asymmetric Laplace Distribution 

Consider a p-dimensional random variable Z from a centralized asymmetric Laplace (CAL) distribution 
(Kotz et al. 2001). The density of Z is given by 



f( , v , _ 2K„ (u) ( z'X-'z 
A 1 ' ' (2^)p/2|S|i/2 l v2 + a'E- 1 a 



v/2 



expjz'S 1 ct\, 



(3) 



where v = (2 — p)/2, u = \J {2 + a'S 1 a) (z'S 1 z), £ is the covariance matrix, and a € W represents 
the skewness in each dimension. Kotz et aT] ( |2001[ ) use the notation Z ^ AC p (a, S) to indicate that the 
random variable Z follows a p-dimensional CAL distribution. 

The CAL density ^ is prohibitive for model-based clustering and classification applications because it 
would force each component density to be joined at the same origin. To address this problem, consider 



2 



Z ^ ACp (a, 53) and introduce a shift parameter /i e R p by considering a random vector X. = (Z + fi) ^ 
SA£ p (a, 53, p), where <SAC p (a:, 53, /^) denotes a p-dimensional shifted (non-centralized) asymmetric Laplace 
(SAL) distribution with density given by 

f { \ V s ! 2cxp{(x- At yS- 1 Q } / <S(x, M |S) V 7 ' , v ^ 



where it = ^/(2 + ct'53 1 a)(S (x, | 53), 5 (x, fi | 53) = (x — fx)' 53 (x — /it) is the squared Mahalanobis dis- 
tance between x and /x, and z/, a, and 53 are defined as before. 



Kotz et al. (2001 1 note that the random variable Z ^ AC P (a, 53) can be generated through the relation- 
ship Z = Wa + \/WY, where W is a random variable from an exponential distribution with mean 1 and 
Y ^ A/"(0, 53) is generated independently of Therefore, the random variable X ^ <S_4£ p (a:, 53, /i) can be 
simulated through the relationship X = fj, + Wa + y/WY and soX| W — w ^ Af (fx + wa, «;53). 

The distribution of W conditioned on the data can be computed through the use of Bayes' theorem, 
fw(w | X = x) = / x (x | W = w)h(w)/fjc(x), where x | W = w — Af(fi + wa, iu53), W ^ exp(l), and /x(x) 
is the density of the shifted asymmetric Laplace distribution given in Q . It follows that 



f w {w | X = x) = 



5(x )M | 53) V" /2 exp{-^<J(x,/i | 53) - f (2 + a'S- 1 a)} 



2 + a'lT 1 a) T , ( fZT, 77, T^7\ (5) 



K v ^(2 + a'-£- 1 a)S{ X , f i | 53)J 



where z/, a, fi, 53, and 5 (x, Li | 53) are as defined for Q. Recalling the density of a GIG random variable ([I]), 
it then follows from (|5| that fw(w | X = x) is the GIG density with a = 2 + a'53 _1 a and b = <5(x, \x | 53). 

We introduce a finite mixture of SAL distributions so that the gth component density is SAC p (a g , 53 g , fx g ), 
where the parameters are as defined for Q. The density of a mixture of SAL distributions is simply 
/(x | ■&) = Y^=i w g£{ x I a gi Eg, Mg), where i9 is the vector of all model parameters and £(x | a g , 53 g , /x g ) is 
the SAL density from Q. 

3 Parameter Estimation 
3.1 EM Algorithm 



The expectation-maximization (EM) algorithm (Dempster et al. 1977) is an iterative procedure for finding 



the maximum likelihood estimates when data are incomplete or are treated as such. EM algorithm compu- 
tations are based on the complete-data likelihood, i.e., the likelihood of the observed data plus the latent 
or missing data. In the E-step, the expected value of the complete-data log-likelihood is computed; in the 
M-step, this value is then maximized with respect to the model parameters. The E- and M-steps are then 
iterated until some convergence criterion is attained. 

A common criticism of the use of the EM algorithm for model-based clustering is that the singularity- 
riddled likelihood surface makes parameter estimation unreliable and heavily dependent on the starting 



values. To help overcome this problem, Zhou and Lange (2010) introduced a deterministic annealing algo 



rithm that flattens the likelihood surface by introducing an auxiliary variable v € [0, 1]. We will illustrate this 
approach by using a version of this deterministic annealing algorithm in conjunction with an EM algorithm 



to fit our SAL mixture model (cf. Section 3.3) 



3.2 Application to Mixture of SAL Distributions 

For our SAL mixture models, the complete-data comprise the observed Xi, . . . ,x„, the component mem- 
bership labels ti,...t„, and the variable W. For each i, we have Tj = (th , . . . , tig) where Ti g = 1 if 
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observation i is in component g and Ti g = otherwise, for i = 1, . . . , n and g = 1, . . . , G. The complete-data 
likelihood is given by 

n G 

£ = Yl H ( x ' ; I ^5 + w i a gi w i 'S g ) h(wi)} T ' 3 , (6) 
i=i fl =l 

with the same notation as previously and where </> (xj | fi g + iUja: fl , ?DjX! a j is the density of a multivariate 
Gaussian distribution with mean fj, g +Wia g and covariance matrix WiTi g . The expected-value of the complete- 
data log-likelihood is given by 

G n G n G 

Q = n g logTT, - *f log 2n - ^ £ logE \W t | xj + £ ^ log (S^ 1 1 + ^ E ^ ( x * " M,)' E ^ 

n G n G 

- 2 EE f * ( x « - *v>' E tv^i i Xi ] s; 1 (x 4 - m s ) - 2 E E ^ E i x *i "s 5 ^ 

i=l g—1 i—1 g—1 

n G 

i=l g=l 

where n g = £)£=i ^9 an d 

V,., i~;i( x - I '>:,.£.,.//,) 

The expected values E [Wi | Xj] and E [1/Wj | x,] are computed using the formulae in (|2j). 

In the M-step, we maximize Q with respect to the model parameters to get the updates. Specifically, the 
mixing proportions, skewness parameter, and shift parameter are updated by jr g — n g /n, 



and 



n \ / 7i \ / ti 

E f i9 E I W I X »] ) ( E f iB*i ) - n 9 ( E ^ E [V^i I X,] X 
t=l ' ^ i=l ' ^ »=1 

- { ( E ^ E i x *i ) ( E ^ E [vw< i x. ( 
^ ^ i=i ' ^ »=i 

a s = { ( E f ^ E i x <] ) ( E ^ E [vw< i x,] x ^ - % (e ^ x <) } 

- { ( E ^ E [Wi I x,] ) ( E ^ E [vw< I x, 

^ \ i=l ' ^ i=l 



n 2 



n 2 

"■g 



respectively. Each component covariance matrix S g is then updated by 

1 " 

S g = S g - a 9 r 9 - r g a g H a g a g f ig E [Wj | Xj] , (8) 

n s i=l 

where S g = £" =1 E [1/W< | x,] (x* - p, g ) (x, - /Lt ? )' and r g = (l/n g ) £" =1 ( x ; - £«,)• 

At convergence, the fi g are the a posteriori probabilities of component membership for each observation 
and can be used to cluster the observations into groups. Predicted classifications are obtained via maximum a 
posteriori (MAP) probabilities, where MAP{fi S } = 1 if max s {fi S } occurs at component g and MAP{fi g } = 
otherwise. 
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3.3 Initialization 



The deterministic annealing algorithm (Zhou and Lange 2010) is the same as the EM algorithm described 
in Section pP] except that now 



E [T ig | Xj] 



[tt^(x, | a g ,-E g ,fi g )] v 



(9) 



in each E-step. Here, the auxiliary parameter v € [0, 1], which is drawn from an increasing sequence of user- 
specified length, transforms the likelihood surface to improve the chances of finding the dominant mode. 
The length of the user-specified sequence, which runs from to 1, determines how many iterations of the 
deterministic annealing algorithm will be preformed. The annealing algorithm itself is initialized using 
random starting values of a g , S g , and fi g . In our analyses (Section El), we run the deterministic annealing 
algorithm ten times and choose the values that give the highest likelihood as the starting values for our EM 
algorithms. 



3.4 Convergence 
3.4.1 Aitken acceleration 



The Aitken acceleration (Aitken 1926) is used to determine convergence of our EM algorithms. An EM 



algorithm can be considered to have converged when — i^+i) <- £j w here /C £ + 1 ) is the log-likelihood at 

iteration k + 1 and 

/(fc+l) ;(fe) 
,(fc+l) _ ,(fe) , l _ ' 

00 + 1-oW ' 



is the asymptotic log- likelihood at iteration fc + l (cf. Bbhning et al. 1994). The value 
is the Aitken acceleration at iteration k. For the analyses here in, we use a slightly mo dified convergence 



criterion, stopping our EM algorithms when — < e (cf. McNicholas et al. 2010) 



3.4.2 Dealing with infinite likelihood 

As our algorithm iterates, we must handle the complications that arise when computing fi g . As the EM 
algorithm iterates toward convergence, the value of jj, g will tend to an observation x^ due to the term 
[5(x, /it | S)/(2 + a'S _1 a)] !/ ' /2 in the multivariate SAL density Q. Although these estimates of (i g maximize 
the likelihood, they create computational issues when trying to determine the remaining parameter values 
and, specifically, the expected value E[l/Wj | x,]. 

To overcome this problem, we stop searching for [i g when fi g has the same value as some Xj because at 
this point the log-likelihood becomes infinite. We proceed by taking the value of fi g at the iteration before 
it becomes equal to any x, (we denote this value jx*) as the MLE of fi g . We then compute the MLE of a g 
using 

* * _ Si=l fig ( x ' ~ Mg) 

[Y;?= 1 n g E[W i \x i ]_ ' 

and compute S g given the update in (11). We use a different approach to overcome this problem in the 
deterministic annealing algorithm, simply restricting the expected value of E[l/Wj | Xj] from exceeding a 
value of — log(l — v) at each iteration. We acknowledge that our solution to this problem is a simple-minded 
one. However, we have found it to be quite effective and a more thorough exploration of the problem is the 
subject of ongoing work. 
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3.5 Model-based classification 

Suppose we have n observations and k of these observations have known group memberships. We can 
use the group memberships of these k observations to estimate memberships for the remaining n — k ob- 
servations within a joint likelihood framework. This approach, known as model-based classification, is a 
semi-supervised version of model-based clustering. Without loss of generality, we order the observations 
xi, . . . , Xfc, Xfc + i, . . . , x„ so that the first k have known group memberships. Therefore, the values of Ti g are 
known for i = 1, . . . , k and the SAL model-based classification likelihood is given by 

k G n J 

C (Xi, . . . , X„,Ti, . . . ,T G | OL g , S g , fl g ) 

where J > G. Parameter estimation is carried out in an analogous fashion to model-based clustering. 



3.6 Model Selection and Performance 



The Bayesian information criterion (BIC; Schwarz] 1978) is commonly used for Gaussian mixture model 
selection: BIC = 2Z(x | i?) — vlogn, where Z(x | i?) is the maximized log-likelihood, i? is the maximum 
likelihood estimate of v is the number of free parameters in the model, and n is the number of observations. 
In the analyses herein, we prefer to use a model selection approach that is specifically designed for clustering 
and classification applications. To this end, we use the integrated classification likelihood (ICL; |Biernacki 
et al. 2000), which is calculated using 



ICL w BIC + E MAP ^ lo S^' 



(11) 



i=fc+l g=l 



where X)"=fe+i Sg=i MAP{5i 9 } log Zi 9 , the estimated mean entropy, reflects the uncertainty in the classifi- 
cation of observation i into group g. 

To assess clustering and classification performance, we use a cross tabulation of our MAP classifications 



against the true group memberships. We then compute the adjusted Rand index (ARI; Hubert and Arabie 



1985 ). The Rand index ( Rand 1971 ) was introduced to compare partitions; it is the ratio of pairs that should 



be and are together plus pairs that should be and are apart, divided by the total number of pairs. The Rand 
index takes a value between and 1, where 1 indicates perfect agreement. An unattractive feature of the 
Rand index is that it has a positive expected value under random classification. To correct this undesirable 
property, Hubert and Arabie ( |1985[ ) introduced the ARI to account for chance agreement. The ARI also 
takes a value of 1 when classification is perfect but has an expected value of under random classification. 
The ARI can also take negative values; this happens for classifications that are worse than would be expected 
by chance. 



4 Data Analyses 
4.1 Introduction 



The SAL mixture models are applied to simulated data (Section 4.2 ) and to two real data sets; the famous Old 



Faithful geyser data (Section 4.3) and data on cellular localization sites for proteins in yeast (Section 4.4). 
The simulation study was conducted to assess the accuracy of our SAL parameter estimates, including 
selection of the number of components, and to draw comparison with the Gaussian mixture model approach. 
In the real analyses, we illustrate the SAL approach, judging its performance alongside that of Gaussian 
mixtures. 



G 



Parameter 1 



Figure 1: Example of a simulated data set (n = 500), coloured by component. 



On the face of it, a comparison of our SAL approach to Gaussian mixtures might be considered a little 
empty because one would expect Gaussian mixtures to use more than one component to model a cluster 
with skewness. Indeed, there has been a lot of work within the literature on merging Gaussian components 



(Baudry et al. 2010 Hennig 2010). However, our examples illustrate that when predicted classifications 
differ for the SAL and Gaussian approaches, merging Gaussian components cannot always be used to rectify 
shortcomings in the Gaussian results (cf. Sections 4.2 and 4.4). One may also argue that the mixture 
of multivariate skew-£ distributions could also be used to accommodate skewness. This is true, but two 



different representations of the mixture of skew-i distributions have appeared within the literature (cf. Sahu 



et al. , 2003 Pyne et al. 2009 ) and neither have the elegance of SAL mixtures; this is most apparent in 



the intractability of the E-step for the skew-i approach (cf. Lin 2010 Lee and McLachlan 2011| Vrbik and 



McNicholas[|2012 |. 

Note that, for all analyses, we use the same deterministic annealing starting values for the mixtures of 
Gaussian distributions as for our SAL mixtures. Further, we treat each analysis as a genuine clustering 
problem by removing all labels; we then apply SAL and Gaussian mixture models. 



4.2 Simulation Study 

We used the relationship between the SAL and Gaussian distributions (cf. Section |2.2[ ) to generate mul- 
tivariate SAL data; specifically, we simulated 25 data sets for n = 500 with p = 2 dimensions and G = 2 
components (e.g., Figure]!]). The data were generated using skewness parameters e*i = (2, 1) and a.2 = (2, 2), 
shift parameters \x x = (0, —2) and fi 2 = (0, 5), and covariance matrices, 



Si = 



1 

0.5 



0.5 



and 



S 2 = 



1 
1 



The clustering results (Table [IJ show that the SAL mixtures give almost perfect clustering performance 
over all 25 runs (average ARI=0.9968). The Gaussian approach, however, gave relatively poor clustering 
performance (average ARI=0. 49880) and never returned the correct number of clusters. The most common 
Gaussian solution (chosen 21 times) has five components and an example is given in Figure [2j each of the 
other four solutions had four components. 

The five-component Gaussian solution depicted in Figure [2] is typical of 21 of the cases observed in 
this simulation. Particularly striking here is that the components cannot be combined to give the correct 
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Table 1: Summary for the analysis of the 25 simulated data sets using our SAL mixtures, and using the 
Gaussian approach. 



SAL 


G = 2 selected 
Average ARI (std. dev.) 


100% 

0.9968025 (0.005157439) 


Gaussian 


G = 2 selected 
Average ARI (std. dev.) 


0% 

0.4987959 (0.06052465) 




Figure 2: One of the G — 5 Gaussian solutions for the simulated data, coloured by MAP classification 
results. 



component structures. Instead, there is a clear 'noise' group comprising points that do not fit neatly within 
any of the other four components. Although one might argue that this problem is obvious by inspection in 
two dimensions, and might be resolved in some post hoc fashion, it would be virtually impossible to detect 
or resolve in all but very low dimensional examples. 



4.3 Old Faithful Geyser Data 



The famous Old Faithful geyser data, which are available as faithful in R (R Development Core Team 



2012), is a two- variable data set measuring the waiting time between eruptions and the duration of 272 
eruptions of the Old Faithful geyser in Yellowstone National Park. These data are well-known as an example 



of skewness and they have been used many times to illustrate approaches to analyzing skewed data (e.g., Ali 
et al. 2010). Our mixture of SAL distributions and mixtures of Gaussian distributions were fitted to the 



Geyser data. There are no 'true' classifications for these data but both approaches selected sensible groups 
with identical classifications. The associated contour plots (Figure [3]) illustrate that the fit to the data is 
better for the the mixture of SAL distributions. Note that applying a mixture of skew-i distributions to 



these data will also give nicely fitting contours (cf. Vrbik and McNicholas 2012[ ). However, as discussed in 
Section |4.1[ the mixture of SAL distributions is less computationally cumbersome. 
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SAL 



GMM 




-1.5 -1 .O -0.5 O.O 0.5 1.0 1.5 -1.5 -1.0 -0.5 O.O 0.5 1.0 1.5 



Figure 3: Model-based clustering results with contours for the SAL and Gaussian mixture models on the 
Old Faithful data. 



4.4 Yeast Data 



4.4.1 The Data 

The yeast data, which are available from the UCI machine learning repository, contain cellular localization 
sites of 1,484 proteins. The development of these data, as well as classification results using a 'rule-based 
expert system', is discussed by Nakai and Kanehisa ( 1991 1992 ). For illustration, we consider three variables: 



McGeoch's method for signal sequence recognition (meg), the score of the ALOM membrane spanning region 
prediction program (aim) , and the score of discriminant analysis of the amino acid content of vacuolar and 
extracellular proteins (vac). The goal of our cluster analysis is to distinguish between the two localization 
sites, CYT (cytosolic or cytoskeletal) and ME3 (membrane protein, no N-terminal signal), for these data 
(cf. Figure E|. 




Figure 4: The yeast data with the CYT and ME3 location sites highlighted. 
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4.4.2 Model-Based Clustering Results 



The SAL and Gaussian mixture models were fit to these data for G = 1, . . . , 5 groups and the best fitting 
model was chosen using the ICL. The chosen SAL mixture model had two components (ICL=— 5173.133) and 
the chosen Gaussian mixture model had three components (ICL=— 5161.878). The classification performance 
of the SAL mixture model (ARI=0.81) is superior to that of the Gaussian mixture model (ARI=0.56), as 
illustrated in Table [2] and Figure [5] Again, this superiority cannot be negated by combining Gaussian 
components. 

Table 2: Clustering results for the best SAL and Gaussian mixture model, as chosen by ICL, for the yeast 
data; the two localization sites are cross-tabulated against our predicted classifications (A, B, C) in each 
case. 





SAL 






GMM 






A 


B 


A 


B 


C 


CYT 


448 


15 


379 


12 


72 


ME3 


14 


149 


13 


11 


139 




Figure 5: Clustering results for the SAL and Gaussian mixture models on the yeast data; the colours highlight 
the predicted component memberships. 

Now, one may argue that the Gaussian mixture model would perform better under a different model 
selection criterion. Therefore, we also investigated the Gaussian mixture model with G = 2 (Table [3]). 
Surprisingly, this Gaussian mixture model gives poor clustering performance, producing classifications that 
are worse than would be expected by guessing (i.e., ARI=— 0.088 < 0). 

Table 3: Classification results for the two component Gaussian mixture model on the yeast data; the two 
localization sites are cross-tabulated against our predicted classifications (A, B). 





A 


B 


CYT 


106 


357 


ME3 


1 


162 



4.4.3 Model-Based Classification Results 

To compare model-based classification within the SAL and Gaussian mixture modelling frameworks, we 
analyze the yeast data with 70% of the group memberships taken to be known. We set G — 2 and each 
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model was fit with 25 different random 70/30 partitions of the data. The aggregate classification results 
(Table [4J and ARIs (0.86 and —0.080, respectively) indicate that the SAL mixture models outperform their 
Gaussian counterparts by some margin. 

Table 4: Aggregate classification results for the SAL and Gaussian mixture models for the yeast data 
with 70% of the labels taken as known; the two localization sites are cross-tabulated against our predicted 
classifications (A, B) for the observations with unknown labels in each case. 







SAL 






GMM 




A 




B 


A 


B 


CYT 


3403 




71 


964 


2502 


ME3 


87 




1139 


29 


1205 



The poor performance of the Gaussian mixture modelling approach here is surprising because the pa- 
rameter estimates are computed with 70% of the location sites taken as known. This result raises questions 
around the efficacy of the semi-supervised Gaussian model-based classification approach, as well as further 
reinforces the need for more flexible non-Gaussian approaches. Once again, the poor performance of the 
Gaussian approach on these data cannot be mitigated by merging components. 



5 Summary and Discussion 

A mixture of SAL distributions model was introduced and applied for both clustering and classification. 
An EM algorithm is used for parameter estimation, with starting values obtained using the deterministic 



annealing approach of Zhou and Lange (2010). To account for both parsimony and entropy, the ICL is used 
to select the number of mixture components. Our model-based clustering approach was illustrated on both 
real and simulated data. 

In the simulation, data were generated from a SAL mixture model with two components that were very 
well separated. The SAL mixtures gave near-perfect results on these data whereas the Gaussian mixture 
models consistently overestimated the number of components. Furthermore, it is particularly notable that 
the overestimation of the number of components by the Gaussian mixture models cannot be resolved by 
merging components. 

We also analyzed two real data sets. For the Old Faithful data, we considered only model-based clustering 
with the SAL and Gaussian mixture models and both gave the same predicted group memberships. However, 
a contour plot revealed that the SAL mixture model captured the shape of the data far more effectively 
than its Gaussian counterpart. The yeast protein location data presented a much more difficult clustering 
problem, and so we also used these data to illustrate model-based classification. For model-based clustering, 
the chosen SAL model gave very good clustering performance (G = 2, ARI=0.81) and outperformed its 
Gaussian counterpart (G = 3, ARI=0.56). Further, when we forced G = 2 components, the Gaussian 
mixture modelling approach gave worse clustering performance than would be expect by guessing (ARI< 0). 
In the model-based classification applications, we set G = 2 and considered 25 random subsets of the data 
for which 70% of the locations were taken as known. The SAL mixtures again gave excellent performance 
(ARI=0.86) but the Gaussian mixtures had negative ARIs; it is surprising that the Gaussian approach gave 
very poor classification performance in the semi-supervised case when 70% of the locations were taken as 
known. This result calls into question the efficacy of the Gaussian model-based classification approach. 
Furthermore, as with the simulation study, the poor performance of Gaussian mixtures on the yeast data 
cannot be mitigated by merging components. 



A decade on from the landmark paper of Fraley and Raftery (2002), we have put forth a case for 



substantial departure from the Gaussian model-based clustering paradigm. Unlike the skew-normal and 
skew-i approaches, which are perhaps less significant departures, our approach is elegant and computationally 
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straightforward. This paper reinforces the fact that the literature is moving away from Gaussian approaches 
with some alacrity but it should not be taken as being pejorative of Gaussian mixture models. Gaussian 
mixture models remain a highly effective method for clustering and classifying certain types of data, and will 
forever be the midwife that introduced mixture model-based clustering. They can, however, give misleading 
results and, as we have illustrated, one cannot rely on rectifying poor performance by merging components. 

As mentioned in Section |3.4.2| a better solution to the issue of infinite log-likelihood values in our EM 
algorithm is the subject of ongoing work. Future work will focus on the introduction of parsimony into 
these models by decomposing the covariance structure in the same way as described by |Celeux and Govaert 
(1995) and Fraley and Raftery (2002) for the Gaussian mixture model (S g = A s D s A g D g ). Analysis of very 
high-dimensional data with our HAL mixtures will be explored along the lines of work by [McNichola s~and| 



Murphy (2008 2010) and Baek et al. (2010), who focused on extensions of mixtures of factor analyzers in 



the Gaussian case. Work will also be conducted on model selection to compare SAL mixtures with other 
mixtures, such as mixtures of i-distributions, with a special focus on clustering applications. 
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